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Abstract 

The aim of this paper is to provide a sound framework for address- 
ing a difficult problem: the automatic construction of an autonomous 
agent's modular architecture. We combine results from two appar- 
ently uncorrelated domains: Autonomous planning through Markov 
Decision Processes and a General Data Clustering Approach using a 
kernel-like method. Our fundamental idea is that the former is a good 
framework for addressing autonomy whereas the latter allows to tackle 
self-organizing problems. Indeed, we derive a modular self-organizing 
algorithm in which an autonomous agent learns to efficiently spread n 
planning problems over m initially blank modules with m < n. 

Introduction 

This paper addresses the problem of building a long-living autonomous 
agent; by long-living, we mean that this agent has a large number of rel- 
atively complex and varying tasks to perform. Biology suggests some ideas 
about the way animals deal with a variety of tasks: brains are made of spe- 
cialized and complementary areas/modules; skills are spread over modules. 
On the one hand, distributing functions and representations has immediate 
advantages: parallel processing implies reaction speed-up; a relative inde- 
pendence between modules gives more robustness. Both properties might 
clearly increase the agent's efficiency. On the other hand, the fact of dis- 
tributing a system raises a fundamental issue: how does the organization 
process of the modules happen during the life-time ? 
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There has been much research about the design of modular intelligent 
architectures (see for instance [15] [5] [1] [7]). It is nevertheless very often 
the (human) designer who decides the way modules are connected to each 
other and how they behave with respect to the others. Few works study 
the construction of these modules. To our knowledge, there are no effective 
works about modular self-organisation except for reactive tasks (stimulus- 
response associations) [6] [8] [3]. 

This paper proposes an architecture in which the partition in functional 
modules is automatically computed. The most significant aspect of our work 
is that the number m of modules is fewer than the number n of tasks to 
be performed. Therefore, the approach we propose involves a high-level 
clustering process where the n tasks need to be "properly" spread over the 
m modules. 

Section 1 introduces what we consider as the theoretical foundation for 
modelling autonomy: Markov Decision Processes. Section 2 presents the 
state aggregation technique, which allows to tackle difficult autonomous 
problems, that is large state space Markov Decision Processes. Section 3 
describes the Kernel Clustering approach: it will stand as a theoretical 
basis for addressing self-organization. Kernel Clustering will indeed lead to 
a generalization of the state aggregation technique, which we will interpret 
as a modular self-organization procedure. Finally, Section 5 will present 
empirical results about the self-organization of an autonomous agent that 
has to navigate in a continuous environment. 

1 Modelling A Mono- Task Autonomous Agent 

Markov Decision processes [12] provide the theoretical foundations of chal- 
lenging problems such as planning under uncertainty and reinforcement 
learning [14]. They stand for a fundamental model for sequential decision 
making and they have been applied to many real worls problem [13]. This 
section describes this formalism and presents a general scheme for approach- 
ing difficult problems (that is problems in large domains). 

1.1 Markov Decision Processes 

A Markov Decision Process (MDP) is a controlled stochastic process sat- 
isfying the Markov property with rewards (numerical values) assigned to 
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state-control pairs 1 . Formally, an MDP is a four-tuple (S,A,T,R) where S 
is the state space, A is the action space, T is the transition function and 
R is the reward function. T is the state-transition probability distribution 
conditioned by the control : 

T(s, a, s') d = Pr(s m = s'\s t = s,a t = a) (1) 

R(s, a) <G 1R is the instantaneous reward for taking action a G A in state S. 

The usual MDP problem consists in finding an optimal policy, that is a 
mapping it : S — ► A from states to actions, that maximises the following 
performance criterion, also called value function of policy it: 



V n (s) = E 



^7*. J R(s t ,7r(s t ))|s = s 
t=o 



(2) 



It is shown [12] that there exists a unique optimal value function V* which 
is the fixed point of the following contraction mapping B* (called Bellman 
operator): 



[B*.f] (s) = max (r(s, «) + 7- E T ( s > «> 



(3) 



Once an optimal value function V* is computed, an optimal policy can 
immediately be derived as follows: 



tt*(s) = argmax ^R(s, a) + 7. ^ T («> «> a').V*(a')j 



(4) 



Therefore, solving an MDP problem amounts to computing the optimal 
value function. Well-known algorithms for doing so are Value Iteration and 
Policy Iteration (see [12]). Their temporal complexity dramatically grows 
with the number of states [9], so they can only be applied to relatively simple 
problems. 



1.2 Addressing a Large State Space MDP 

In very large domains, it is impossible to solve an MDP exactly, so we usually 
address a complexity /quality compromise. Ideally, an approximate scheme 
for MDPs should consist of a set of tractable algorithms for 

Though our definition of reward is a bit restrictive (rewards are sometimes assigned 
to state transitions), it is not a limitation: these two definitions are equivalent. 
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• computing an approximate optimal value function 

• evaluating (an upper bound of) the approximation error 

• improving the quality of approximation (by reducing the approxima- 
tion error) while constraining the complexity. 

The first two points are the fundamental theoretical bases for sound ap- 
proximation. The third one is often interpreted as a learning process and 
corresponds to what most Machine Learning researchers study. For conve- 
nience, we respectively call these three procedures Approximate^), ErrorQ 
and Learni). The use of such an approximate scheme is sketched by al- 
gorithm 1: One successively applies the Learni) procedure in order to 

Algorithm 1 A general approximation scheme for a large state MDP 

Input: a large state space MDP Ai and an initial approximation M. 
Output: a good approximate value function V* . 

while Error(M,M) goes on diminishing do 

M <— Learn(M, M) 
end while 

V* <— Approximate(M) 



minimize the approximation error; when this is done, one can compute a 
good approximate value function. Next section describes an example of 
such a set of procedures for practically approximating a large state space 
MDP. 

2 The State Aggregation Approximation 

This section reviews an example of approximation scheme for solving large 
state space MDPs. The class of approximations we consider is the state 
aggregation approximation, that is approximate models in which whole sets 
of states are treated as if they had the same parameters and underlying 
values. 

Given an MDP M = (S,A,T,R), the state aggregation approximation 
formally consists in introducing the MDP hA = (S, A, T, R) where the state 
space S is a partition of the real state space S. Every element of S, which 
we call macro-state, is a subset of S and every element of S belongs to one 
and only one macro-state. Conversely, every object defined on S can be seen 
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as an object of S which is constant on every macro-state. The number of 
elements of S can be chosen little enough so that it is feasible to compute 
the approximate value function of the approximation Ai . 

Using some recent results by [10], we are going to describe how the 
procedures Approximate^), Error() and LearnQ (introduced in previous 
section) can be defined. 



2.1 Computing an Approximate Solution 

When doing a state aggregation approximation, natural choices for the ap- 
proximate parameters R and T are the averages of the real parameters on 
each macro-state: 

R(s,a) = i.£ s6?J R(s,a) 
T(si,a,8 2 ) = j^.J2 {s , s < )e r lX r 2 T(s,a,s') 

From these, an approximate value function V* can be computed: it is the 
fixed point of the approximate Bellman operator B* (defined on S): 

>./] (?) = max |%, a) + 7 . E f (*. °> *')•/(*') j (6) 

This constitutes the Approximate^) procedure in the state aggregation 
approach. 



2.2 Bounding the Approximation Error 

Let B* be the exact Bellman operator of Ai (see eq. 3). Let V* be the 
real value function. In practice, we would like to evaluate how much the 
approximate value function V* differs from the real value function V*, i.e. 
we want to compute the approximation error on each macro-state s: 

E app (s)= f m a x\V*(s)-V*(s)\ (7) 

The authors of [10] show that the approximation error depends on a quantity 
they call interpolation error which is easier to evaluate: 

E int (s) = f max \B*.V*(s) - B*.V*(s)\ (8) 

The interpolation error is the error due to one approximate mapping B* of 
the real value function V* ; it measures how the approximate parameters 
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(R(s, a), T(s, a, .)) locally differ from the real parameters (R(s, a),T(s, a, .)). 
Indeed, it can be shown that for some constant K 

Ei n t(s) < max_ R(s, a) — R(s, a) (9) 

a£A,s£s 

+ K. max I ^2 \T(s,a,s') — T(s,a,s'] 



We can deduce from equations 5 and 9 an upper bound E int (s~i) of the 
interpolation error on the macro-state si: 

E~ t (si) = Ai?(si) + K.J2 ~AT(si, S2) (10) 



with 



AR(s) = — . max \R(s,a) - R(s',a) 
\s\ ( s ,s')e? 



and . 

AT(si,^) = — . max_| V T(s 1 , a, s 2 ) - T(s[, a, s 2 )\. 
\si\ (si,s',)esi 

Once we have an upper bound of the interpolation error, the authors of 
[10] show that an upper bound E app (s) of E app (s) is the fixed point of the 
following contraction mapping: 



E.f] (si) = E int {ii) + max ( 7 . ]T f (si, a, £)./(!!) j (11) 



S2 



We thus have an ErrorQ procedure. 
2.3 Improving the Approximation 

Finally, this subsection explains how one might improve a state aggregation 
approximation by iteratively updating the partition S. 

The authors of [10] introduce the notion of influence Is {s) of the in- 
terpolation error at macro-state s on the approximation error over a subset 
S c S: _ 

dE int (s) 
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They prove that the influence Is is the fixed point of the following contrac- 
tion mapping: 

[D - f] ^ = { I Iff 3V t +T-E f ( p '^,(?)^)./(?) (13) 

where 7r err (s) = argmax a ^2 S , T(s, a, s').E app (s') (see [10] for more details). 

Say we update the partition for some macro-state s (e.g. we divide s in 
two new macro-states). The interpolation error AEi nt (s) will change and 
a gradient argument shows the effect this will have on the approximation 
error: 

A I J2 E^pG') - Is (s).AE~ t (s) (14) 
\s'es J 

Using this analysis, we are able to predict the effect that locally refining 
(or coarsening) the partition S has on the approximation error we want to 
minimize. This allows to efficiently and dynamically balance resources of 
the approximation over the state space. 

This constitutes a learnQ procedure for the state aggregation approxi- 
mation. Experimental demonstrations of a similar approach can be found 
in [11]. 

3 Kernel Clustering 

So far, we have recalled recent results for approximating a unique large state 
space MDP. When trying to model a long-living autonomous agent, it is more 
realistic to consider that it does not only have one problem (one MDP) to 
solve but rather many (if not an infinity): (Mi)i<i< n = ({S, A, Ti, Ri))\<i< n - 
In order to address such a case, we first need to present the Kernel Clustering 
paradigm. This is what we do in the remaining of this section. 

3.1 Definitions 

In [2] , the author introduces an abstract generalization of vector quantiza- 
tion, which he calls Kernel Clustering. Indeed, the author argues that, in 
general, a clustering problem is based on three elements: 

• (xi)i e j: a set of data points taken from a data space X 

• {Li, .., L m }: a set of kernels taken from a kernel space C 
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• d : X x £ ^ IR + : A distance measure between any data point and any 
kernel. The smaller the distance d(x,L), the more L is representative 
of the point x. 

Given a set of kernels {L\, ..,L m }, a data point x is naturally associated to 
its most representative kernel L(x), i.e. the one that is the closest according 
to distance d: 

L(x) = argmin Le{Lli Lm} d(x,L) (15) 

Conversely, a set of kernels {L\, .., L m } naturally induces a partition of the 
data set (xj)ig/ into m classes {C\, ..,C m }, each class corresponding to a 
kernel: 

\/j € (l,..,m),Cj = {{xi) ie r, L(xi) = Lj} (16) 

Given a data space, a data set, a kernel space and a distance d(), the 
goal of the Kernel Clustering problem is to find the set of kernels {L\, .., L^} 
that minimizes the distortion D for the data set (xi)i e j, which is defined as 
follows: 

m 

D = Y,d(x u L( Xi ))=Y, E d{x,Lj) (17) 
iei j=i xeCj 

In other words, solving a clustering problem consists in finding the kernels 
that are the most representative of the data set. 

For instance, the well-known vector quantization problem is a particular 
case of Kernel Clustering where 

• the set of kernels £ and the data space X are M ra 

• the distance d(x, L) is the Euclidean norm \\x — L\\. 

As we will see in the next sections, the power and the richness of the Kernel 
Clustering approach over simple vector quantization comes essentially from 
the fact that kernels and data need not be in the same space. 

3.2 The Dynamic Cluster Algorithm 

An interesting observation about the Kernel Clustering approach is the fol- 
lowing fact: the Dynamic Cluster algorithm [2] (see algorithm 2) for (sub- 
optimally) optimizing the set of kernels is the exact generalization of the 
batch k-means algorithm, which (suboptimally) solves the vector quantiza- 
tion problem (see [4] and [2]). This is an iterative process which consists 
of two complementary steps: 
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Algorithm 2 The dynamic cluster algorithm 
Input: A data set (x,i) i€ j 

Output: A set of kernels {L\, .., L m } that optimizes the clustering (i.e. that 
minimizes the distortion) 
Initialization: 

Let {Ci, .., C m } be any partition of the data set 

Iterations: 

repeat 

1. Find the best set of kernels corresponding to the partition 
{Ci, .., C m }' 

for j from 1 to m do 

Lj «- argmin LG/; Y^xeCj d ( L > x ) 
end for 

2. Find the partition {C\, ..,C m } corresponding to the kernels 
{L±, .., L m }'- 

for j from 1 to m do 

Cj <— {(xj)jg/; L(xi) = Lj} 
end for 

until there is no more change in the partition {Ci, ••> Cm} 



• Given a partition {Ci,..,C m }, find the best corresponding kernels 
{L\, .., L m } 

• Given a set of kernels {Li, .., L m }, deduce the corresponding partition 

{C\, .., Cm}. 

If the latter step is straightforward (one just applies equation 16), the former 
is itself an optimization problem which can be very difficult. In a general 
purpose, it might be easier to use an on-line version of the Dynamic Cluster 
algorithms 2 (see algorithm 3) . The resulting algorithm becomes simple and 
intuitive: for each piece of data x, one finds its most representative kernel 
L, and one updates L so that it gets even more representative of x. Little 
by little, one might expect that such a procedure will minimize the global 
distortion and eventually give a good clustering. 

2 As it is often easier to use on-line version of the k-means algorithm 
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Algorithm 3 The on-line dynamic cluster algorithm 
Input: A data set 

Output: A set of kernels {L\, .., L m } that optimizes the clustering (i.e. that 
minimizes the distortion) 
Initialization: 

Let {L\, .., L m } be any set of kernels 
Iterations: 

while the distortion goes on diminishing do 
Randomly pick a data point x from the data set 
Find the kernel the most representative kernel of x: 

L «- L(x) = argmin L , e{Llj Lm} d(x,L') 

Update L so that d(x, L) diminishes 
end while 



4 Modular Self-Organization For a Multi-Task Au- 
tonomous Agent 

This section is going to show how the (apparently uncorrelated) Kernel 
Clustering paradigm can be used to formalize a modular self-organization 
problem in the MDP framework, the algorithmic solution of which will be 
given by the on-line Dynamic Cluster procedure (algorithm 3). 

If one carefully compares the general learning scheme we have described 
in order to address a large state space MDP (algorithm 1) and the on-line 
Dynamic Cluster procedure (algorithm 3), one can see that the former is a 
specific case of the latter. More precisely, algorithm 1 solves a simple Kernel 
Clustering problem where 

• the data space is the space of all possible MDPs and the data set is a 
unique task corresponding to an MDP M 

• the kernel space is the space of all possible approximations and there 
is one and only one kernel: M. 

• the distance d is the ErrorQ function. 

This observation suggests to make the following parallel between Kernel 
Clustering and MDPs: 
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Kernel Clustering 


Markov Decision Processes 


Data space 

Data set 
Kernel space 
Kernel 
distance 


Space of all possible MDPs 
A set of tasks 
Space of approximate models 
An approximate model 
Approximation error 



The transpositon of the on-line Dynamic Cluster into the MDP framework 
(algorithm 4) therefore allows us to tackle a difficult problem: Finding a 

Algorithm 4 Modular Self-Organization 
Input: A set of MDPs (Mi) ie i 

Output: A set of approximate models (M\, .., M m ) that globally mini- 
mizes the approximation error 
Initialization: 

Let (Mi, ..,M. m ) be any set of approximate models 
Iterations: 

while the global approximation error goes on diminishing do 
Randomly pick a task M from the set of MDPs 
Find the best module for solving M: 

M <- argmin^- ~ m} Error(M' , M) 

M <— Learn(M,M) 
end while 



small set of approximate models that globally minimize the approximation 
error for a large set of MDPs. The result of such an approach can really be 
seen as a modular architecture. Indeed, every time a task (even a new task) is 
given to such a system, all kernels/modules can compute their approximation 
error and the best module for solving the task is the module that makes the 
minimal error. 

5 An Experiment of Modular Self-Organization 

This final section provides an illustration of the Modular Self-Organization 
algorithm 4 where the number of tasks n equals 6 and the number of modules 
m is 3. We illustrate our approach on a navigation problem 3 . An agent has 

Our self-organization algorithm is not limited to a navigation context; it can theoret- 
ically be applied to any problem which can be formulated in the MDP framework 
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to find its way in a continuous environment. This environment consists in 
2 rooms and 2 corridors (see figure 5). The set of states is the continuous 
set of positions (x, y) G (0; 10) 2 in the environment. The actions are the 
8 cardinal moves (amplitude 0.1), whose effects is slightly corrupted with 
noise (amplitude 0.03 and random direction). Six areas, denoted as circles 
in figure 5 are possible goals. One notifies an agent it has reached a goal 
by giving him a strict positive reward (+1). One also gives a negative 
reinforcement (—1) when the agent hits a wall. All the other situations have 
a zero reward. Note that when an agent acts optimally in such a task, it 
only receives a reward when it reaches the goal. 

"'""\. 
3 r "' v 

l ■■ . ■ 

1 ■: 5 :/ i, 



%2l 



■\ 4 % 



m 



Figure 1: A Multi-Task Environment with six goal zones 

We use the six goal areas in order to define six MDPs/tasks. Each of 
these tasks involves going from one zone to another. The following table 
sums them up: 



MDP 


Start 


Goal 


Mi 


zone 2 


zone 1 


M 2 


zone 3 


zone 2 


M 3 


zone 4 


zone 3 


M 4 


zone 5 


zone 4 


M 5 


zone 6 


zone 5 


Me 


zone 1 


zone 6 



We have applied the Modular Self-Organization procedure (algorithm 4) 
with 3 kernels/modules, and with the Err or Q and Learn() functions de- 
scribed in section 2. Figure 2 shows that the performances (obtained with 
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500 simulated runs for each single task) of the system grow for the six tasks. 
Figure 3 shows that the clustering (i.e. the spreading of expertise over 
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Figure 2: Performance evolution during the Modular Self-Organization al- 
gorithm: for each of the six MDPs (and for the average cumulative rewards 
for all six), we see that the system performance is monotonically increasing. 

the 3 modules) eventually stabilizes to an interesting clustering: each of the 
eventual modules deals with two tasks. Finally, we see in figure 4 the state 
aggregations of the resulting 3 modules: we observe that, a module tends to 
describe precisely the goal zones of its two automatically associated tasks. 



6 Conclusion 

In this paper, we have reviewed some recent results for sound approximation 
in large state space Markov Decision Processes and showed how they could 
be applied to the state aggregation scheme. We have then showed how 
such results could be extended to an interesting problem: The Modular 
Self-organizing of an autonomous agent. We have formalized the problem of 
modular self-organization as a clustering problem in the space of MDPs. We 
solved it using an on-line version of the Dynamic Cluster algorithm. Finally, 
we have experimented this approach in a continuous navigation framework, 
where a 3- module agent has to address 6 tasks. 

In future works, we will try to extend this general approach to more pow- 
erful approximations schemes than the state aggregation approach (which 
suffers from the curse of dimensionality). Furthermore, we will investigate 
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Clustering Evolution 
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Figure 3: Evolution of the clustering process: at each iteration, each task 
is naturally associated to one of the six modules (the one that makes the 
minimal error); this spread over eventually stabilizes: at the end, module 1 
deals with M. 2 and M.3, module 2 deals with M.4 and M5, and module 3 
deals with Mi and Mq. 



possible use in reinforcement learning, where the parameters of an MDP 
have to be obtained by experience. 
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